Load the necessary libraries

This report uses quite a few, so make sure you have them installed before running!

The Data

This data is a list of every shooting incident that occurred in NYC from 2006 through the end of 2022. This dataset is intended for public use and allows for the public to explore the nature of the shooting incidents.

This report is a look at the shooting incidents broken down and explored by each borough in NYC. Included is view of total incidents by borough, an examination of incidents over time by borough, and includes a heatmap identifying the physical locations of incidents.

The goal is to find out which borough has the most shooting incidents and which of the boroughs might be the “most dangerous.”

Step 1. Load the data

Besides the csv file that includes the data, both the geospatial data and population data will be required.

# Load the dataset
data_url <- 'https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD'
data <- read.csv(data_url, stringsAsFactors = FALSE)

# Load geospatial data
boroughs <- st_read("https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON")
## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 5 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS 84
# Load population data from the provided JSON link
population_data <- fromJSON("https://data.cityofnewyork.us/resource/xywu-7bv9.json")

Step 2. Data cleaning and transformation

This step involves checking the data and changing the type where necessary.

Here OCCUR_DATE is changed to a Date type.

Additionally rows with missing data are removed. Because the relatively low number of ‘NAs’ in the data that this report is concerned with then removing them is the simplest solution with a low impact.

Finally the centroid calculation is done for he boroughs, this is simply to allow for the naming on the heatmap later.

# Inspect the first few rows
head(data)
##   INCIDENT_KEY OCCUR_DATE OCCUR_TIME     BORO LOC_OF_OCCUR_DESC PRECINCT
## 1    228798151 05/27/2021   21:30:00   QUEENS                        105
## 2    137471050 06/27/2014   17:40:00    BRONX                         40
## 3    147998800 11/21/2015   03:56:00   QUEENS                        108
## 4    146837977 10/09/2015   18:30:00    BRONX                         44
## 5     58921844 02/19/2009   22:58:00    BRONX                         47
## 6    219559682 10/21/2020   21:36:00 BROOKLYN                         81
##   JURISDICTION_CODE LOC_CLASSFCTN_DESC LOCATION_DESC STATISTICAL_MURDER_FLAG
## 1                 0                                                    false
## 2                 0                                                    false
## 3                 0                                                     true
## 4                 0                                                    false
## 5                 0                                                     true
## 6                 0                                                     true
##   PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP VIC_SEX       VIC_RACE
## 1                                           18-24       M          BLACK
## 2                                           18-24       M          BLACK
## 3                                           25-44       M          WHITE
## 4                                             <18       M WHITE HISPANIC
## 5          25-44        M     BLACK         45-64       M          BLACK
## 6                                           25-44       M          BLACK
##   X_COORD_CD Y_COORD_CD Latitude Longitude
## 1    1058925   180924.0 40.66296 -73.73084
## 2    1005028   234516.0 40.81035 -73.92494
## 3    1007668   209836.5 40.74261 -73.91549
## 4    1006537   244511.1 40.83778 -73.91946
## 5    1024922   262189.4 40.88624 -73.85291
## 6    1004234   186461.7 40.67846 -73.92795
##                                         Lon_Lat
## 1 POINT (-73.73083868899994 40.662964620000025)
## 2  POINT (-73.92494232599995 40.81035186300006)
## 3  POINT (-73.91549174199997 40.74260663300004)
## 4  POINT (-73.91945661499994 40.83778200300003)
## 5  POINT (-73.85290950899997 40.88623791800006)
## 6 POINT (-73.92795224099996 40.678456718000064)
str(data)
## 'data.frame':    27312 obs. of  21 variables:
##  $ INCIDENT_KEY           : int  228798151 137471050 147998800 146837977 58921844 219559682 85295722 71662474 83002139 86437261 ...
##  $ OCCUR_DATE             : chr  "05/27/2021" "06/27/2014" "11/21/2015" "10/09/2015" ...
##  $ OCCUR_TIME             : chr  "21:30:00" "17:40:00" "03:56:00" "18:30:00" ...
##  $ BORO                   : chr  "QUEENS" "BRONX" "QUEENS" "BRONX" ...
##  $ LOC_OF_OCCUR_DESC      : chr  "" "" "" "" ...
##  $ PRECINCT               : int  105 40 108 44 47 81 114 81 105 101 ...
##  $ JURISDICTION_CODE      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LOC_CLASSFCTN_DESC     : chr  "" "" "" "" ...
##  $ LOCATION_DESC          : chr  "" "" "" "" ...
##  $ STATISTICAL_MURDER_FLAG: chr  "false" "false" "true" "false" ...
##  $ PERP_AGE_GROUP         : chr  "" "" "" "" ...
##  $ PERP_SEX               : chr  "" "" "" "" ...
##  $ PERP_RACE              : chr  "" "" "" "" ...
##  $ VIC_AGE_GROUP          : chr  "18-24" "18-24" "25-44" "<18" ...
##  $ VIC_SEX                : chr  "M" "M" "M" "M" ...
##  $ VIC_RACE               : chr  "BLACK" "BLACK" "WHITE" "WHITE HISPANIC" ...
##  $ X_COORD_CD             : num  1058925 1005028 1007668 1006537 1024922 ...
##  $ Y_COORD_CD             : num  180924 234516 209837 244511 262189 ...
##  $ Latitude               : num  40.7 40.8 40.7 40.8 40.9 ...
##  $ Longitude              : num  -73.7 -73.9 -73.9 -73.9 -73.9 ...
##  $ Lon_Lat                : chr  "POINT (-73.73083868899994 40.662964620000025)" "POINT (-73.92494232599995 40.81035186300006)" "POINT (-73.91549174199997 40.74260663300004)" "POINT (-73.91945661499994 40.83778200300003)" ...
# Convert OCCUR_DATE to Date type
data$OCCUR_DATE <- as.Date(data$OCCUR_DATE, format="%m/%d/%Y")

# Extract Year-Month from OCCUR_DATE
data$YearMonth <- format(data$OCCUR_DATE, "%Y-%m")

#Remove NA
data <- data[!is.na(data$Latitude) & !is.na(data$Longitude), ]

#Clean the population data
population_data <- population_data %>%
  mutate(BORO = toupper(trimws(borough))) %>%
  select(BORO)%>%
  mutate(Population_2020 = as.numeric(population_data$`_2020`))

#Calculate the Centroid of Each Borough
boroughs_centroids <- st_centroid(boroughs)

Step 3. Visualization and Analysis

The first visual is a bar chart that provides a clear representation of the total number of incidents in each borough. From the chart, we can observe which boroughs have the highest and lowest number of incidents.

This chart shows that Brooklyn has the most shooting incidents. Does this mean that it is the “most dangerous” of the boroughs?

Lets continue to look at the data!

Similar to the first visual the next shows the number of incidents in each borough, but in this case it is broken down by time allowing us to see the trends through time.

## `summarise()` has grouped output by 'YearMonth'. You can override using the
## `.groups` argument.

Clearly Brooklyn lead the pack in the sheer number of shooting incidents! Looks like a good candidate for most dangerous.

Lets take a look at the distribution of incidents across the boroughs using a heatmap!

Hmmm this introduces a little doubt about the previous conclusion. We should investigate the number of incidents relative to the population! Good thing we brought that population data with us!

When the number of incidents is normalized by population the picture is quite a bit different. Suddenly the Bronx becomes the most dangerous.

Just for fun lets see if can use some model to predict incidents based on borough.

## `summarise()` has grouped output by 'OCCUR_DATE'. You can override using the
## `.groups` argument.
## 
## Call:
## glm(formula = incident_occurred ~ BORO, family = binomial, data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5440  -0.9903  -0.4352   1.0433   2.1928  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.32407    0.03087   10.50   <2e-16 ***
## BOROBROOKLYN       0.50603    0.04522   11.19   <2e-16 ***
## BOROMANHATTAN     -0.95158    0.04430  -21.48   <2e-16 ***
## BOROQUEENS        -0.78152    0.04379  -17.85   <2e-16 ***
## BOROSTATEN ISLAND -2.63348    0.06117  -43.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29565  on 21731  degrees of freedom
## Residual deviance: 25297  on 21727  degrees of freedom
## AIC: 25307
## 
## Number of Fisher Scoring iterations: 4

Looking at this summary it would seem that the likelihood of an incident occurring is highly influenced by which borough you are in. This isn’t all that surprising given the data and visuals we have seen so far. Using the Bronx as our intercept (and not normalizing with population) incidents are more likely to occur in Brooklyn and less likely to occur in the other boroughs. Which we have seen to be the case in the raw data.

Lets use a confusion matrix to look at the accuracy of our model in predicting incidents.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4032 1502
##          1 1352 2427
##                                           
##                Accuracy : 0.6935          
##                  95% CI : (0.6841, 0.7029)
##     No Information Rate : 0.5781          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3685          
##                                           
##  Mcnemar's Test P-Value : 0.005286        
##                                           
##             Sensitivity : 0.7489          
##             Specificity : 0.6177          
##          Pos Pred Value : 0.7286          
##          Neg Pred Value : 0.6422          
##              Prevalence : 0.5781          
##          Detection Rate : 0.4329          
##    Detection Prevalence : 0.5942          
##       Balanced Accuracy : 0.6833          
##                                           
##        'Positive' Class : 0               
## 

Here accuracy it looks like our accuracy (or percentage of time our model was correct) is 69.35%. So, decidedly average (Like most of my grades).

It also looks like the model was better at predicting when an incident would occur (Sensitivity) than predicting when an incident would not occur (Specificity).

To help understand we can look at a couple visualizations.

First this heatmap shows how well the model did. Predicted - 0 and Actual - 0 means the model predicted an incident didn’t occur and an incident did not occur while Predicted - 1 and Actual - 1 means the model predicted an incident did occur and an incident actually occurred.

The bar chart is a break down of the indicators mentioned earlier.

Overall the model performs ok, but has a lot of room for improvement.

Issues with the report

  1. The initial data provided is only a count of incidents. This means that while one borough may appear to be more likely to have an incident without some more context it is hard to say for sure. It is better to examine the normalized data using population.

  2. The term most dangerous is a bit disingenuous. While the number of shooting incidents per 100k might be a good starting point for looking at danger its still missing context. A shooting incident could include accidental discharges or be self-inflicted for example. Additionally, there are many other factors to consider such as overall crime and total violent crimes. This is all to say we have an ethical duty to be careful how we communicate conclusions.

  3. it appears that the number of incidents occurring in NYC has been steadily going down since 2006 only to jump back to much higher counts in 2020 and only going back down slowly. There is no way to know what might have caused this given the data available. Perhaps it has to do with global events or local policies. Only conjectures can be made without further information. This would be an easy place to let personal bias creep into the report.

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-94         lattice_0.20-45      jsonlite_1.8.7      
##  [4] sf_1.0-14            leaflet.extras_1.0.0 leaflet_2.1.2       
##  [7] lubridate_1.9.2      forcats_1.0.0        stringr_1.5.0       
## [10] dplyr_1.1.2          purrr_1.0.1          readr_2.1.4         
## [13] tidyr_1.3.0          tibble_3.2.1         ggplot2_3.4.2       
## [16] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-160         RColorBrewer_1.1-3   tools_4.2.2         
##  [4] bslib_0.5.0          utf8_1.2.3           R6_2.5.1            
##  [7] rpart_4.1.19         KernSmooth_2.23-20   DBI_1.1.3           
## [10] colorspace_2.1-0     nnet_7.3-18          withr_2.5.0         
## [13] tidyselect_1.2.0     compiler_4.2.2       cli_3.6.0           
## [16] labeling_0.4.2       sass_0.4.7           scales_1.2.1        
## [19] classInt_0.4-9       proxy_0.4-27         digest_0.6.33       
## [22] rmarkdown_2.23       pkgconfig_2.0.3      htmltools_0.5.5     
## [25] parallelly_1.36.0    highr_0.10           fastmap_1.1.1       
## [28] htmlwidgets_1.6.2    rlang_1.1.1          rstudioapi_0.15.0   
## [31] farver_2.1.1         jquerylib_0.1.4      generics_0.1.3      
## [34] crosstalk_1.2.0      ModelMetrics_1.2.2.2 magrittr_2.0.3      
## [37] s2_1.1.4             Matrix_1.5-1         Rcpp_1.0.11         
## [40] munsell_0.5.0        fansi_1.0.4          lifecycle_1.0.3     
## [43] stringi_1.7.12       pROC_1.18.4          yaml_2.3.7          
## [46] MASS_7.3-58.1        plyr_1.8.8           recipes_1.0.6       
## [49] grid_4.2.2           parallel_4.2.2       listenv_0.9.0       
## [52] splines_4.2.2        hms_1.1.3            knitr_1.43          
## [55] pillar_1.9.0         markdown_1.7         future.apply_1.11.0 
## [58] reshape2_1.4.4       codetools_0.2-18     stats4_4.2.2        
## [61] wk_0.7.3             glue_1.6.2           evaluate_0.21       
## [64] data.table_1.14.8    vctrs_0.6.3          tzdb_0.4.0          
## [67] foreach_1.5.2        gtable_0.3.3         future_1.33.0       
## [70] cachem_1.0.8         xfun_0.39            gower_1.0.1         
## [73] mime_0.12            prodlim_2023.03.31   e1071_1.7-13        
## [76] class_7.3-20         survival_3.4-0       timeDate_4022.108   
## [79] iterators_1.0.14     hardhat_1.3.0        units_0.8-3         
## [82] lava_1.7.2.1         timechange_0.2.0     globals_0.16.2      
## [85] ellipsis_0.3.2       ipred_0.9-14